A Statistical Approach to Mechanized Literature Searching

Communication of ideas is carried out on the basis of statistical probability in that an author chooses that level of specificity and that combination of words which will most probably assure him comprehension on the part of those he intends to reach. Since the execution of this process varies amongst authors and since similar ideas are therefore relayed at different levels of specificity and by means of different words, the problem of literature searching by machines still lacks a satisfactory solution. A statistical approach to this problem will be outlined and the various steps of a system based on this approach will be described. These steps include the statistical analysis of a collection of documents of a chosen field of interest, the establishment of a set of 'notions' and the vocabulary by which they are being expressed, the compilation of a thesaurus-type dictionary and index, the encoding of documents with the aid of the latter, the encoding of topological notations (such as branched structures), the recording of the coded information, the establishment of a searching pattern for finding pertinent information, and the programming of appropriate machines to carry out a search.

By: H. P. Luhn

Published in: RC3 in 1957


This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.


Questions about this service can be mailed to reports@us.ibm.com .